Homework 2

Due: 11:59PM Eastern September 28th

Submit via a direct message to the TA on slack

In this homework we'll explore decision trees and overfitting, and learn about the right way to evaluate the performance of a classifier.

The cell below imports the Python packages that you'll need, including scikit-learn, which conatains implementations of many learning algorithms and supporting infrastructure.

The cell below implements a simple dataset generator that we'll use to explore the impact of various features of datasets that may lead to overfitting.

When evaluating the accuracy of a classifier, the right way to do it is to have a test set of instances that were not used to train the classifier and measure on those instances. The train_test_split() function in scikit makes it easy to create training and testing sets. Below is an example that shows overfitting as evidenced by higher accuracy on the training set than the testing set.

Note that if the training set has 0% class noise, we get a perfect tree. Spend some time convincing yourself that the tree below captures the boolean expression that assigns class labels.

Assignment

Explore the impact of the following on the extent of overfitting:

What to turn in

For each of the parameters mentioned above, vary the value of the parameter and build learning curves for training and testing accuracy, and plot them. That is, the value of the parameter will be on the horizontal axis and accuracy will be on the vertical axis. For each of the parameters write up an explanation for the impact the parameter has on overfitting. Does its value impact overfitting? How significant is the effect? Why do you think that parameter has the observed effect? Also, in each case, display at least one decision tree and explain what is happening that is making it overfit.

Here is an example of generating a learning curve for a fixed size dataset where the fraction of instances used for training is varied. You can use this template to create your own learning curves. NOTE: This example varies the size of the training set. You can use this example and modify it for each of the parameters above, where instead of varying the size of the training set you'll vary the parameter's values.